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ADAPTIVE DIMENSION REDUCTION WITH A 
GAUSSIAN PROCESS PRIOR 

By Anirban Bhattacharya* , Debdeep Pati* and David Dunson''" 
Department of Statistical Science, Duke University^ 

In nonparametric regression problems involving multiple predic- 
tors, there is typically interest in estimating the multivariate regres- 
sion surface in the important predictors while discarding the unim- 
portant ones. Our focus is on defining a Bayesian procedure that 
leads to the minimax optimal rate of posterior contraction (up to 
a log factor) adapting to the unknown dimension and anisotropic 
smoothness of the true surface. We propose such an approach based 
on a Gaussian process prior with dimension-specific scalings, which 
are assigned carefully-chosen hyperpriors. We additionally show that 
using a homogenous Gaussian process with a single bandwidth leads 
to a sub-optimal rate in anisotropic cases. 

1. Introduction. Non-parametric function estimation methods have 
been immensely popular due to their ability to adapt to a wide variety 
of function classes with unknown regularities. In Bayesian nonparamet- 
rics, Gaussian processes (Rasmussen, 2004; van der Vaart and van Zanten, 
2008b) are widely used as priors on functions due to tractable posterior com- 
putation and attractive theoretical properties. The law of a mean zero Gaus- 
sian process Wt is entirely characterized by its covariance kernel c(s, t) = 
E{WsWt). A squared exponential covariance kernel given by c{s, t) = exp(— a | 
is commonly used in the literature. 

It is well established (Stone, 1982) that given n independent observations, 
the optimal rate of estimation of a d-variable function that is only known 
to be a-smooth is C^a+d) _ r^j^^ quality of estimation thus improves with 
increasing smoothness of the "true" function while it deteriorates with in- 
crease in dimensionality. In practice, the smoothness a is typically unknown 
and one would thus like to have a unified estimation procedure that au- 
tomatically adapts to all possible smoothness levels of the true function. 
Accordingly, a lot of effort has been employed to develop adaptive estima- 
tion methods that are rate-optimal for every regularity level of the unknown 
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function. 



The literature on adaptive estimation in a minimax setting was initi- 
ated by Lepski in a series of papers (Lepski, 1990, 1991, 1992); see also 
Birge (2001) for a discussion on this topic. We also refer the reader to 
Hoffmann and Lepski (2002), which contains an extensive list of develop- 
ments in the frequentist literature on adaptive estimation. There is a growing 
literature on Bayesian adaptation over the last decade. Previous works in- 
clude Belitser and Ghosal (2003); De Jonge and van Zanten (2010); Ghosal, Lember and Van Der Vaar 
(2003, 2008); Huang (2004); Kruijer, Rousseau and van der Vaart (2010); 
Rousseau (2010); Shen and Ghosal (2011). 

A key idea in frequentist adaptive estimation is to narrow down the 
search for an "optimal" estimator within a class of estimators indexed by 
a smoothness or bandwidth parameter, and make a data-driven choice to 
select the proper bandwidth. In a Bayesian context, one would place a prior 
on the bandwidth parameter and model-average across different values of 
the bandwidth through the posterior distribution. The parameter a in the 
squared-exponential covariance kernel c plays the role of a scaling or in- 
verse bandwidth, van der Vaart and van Zanten (2009) showed that with a 
gamma prior on a*^, one obtains the minimax rate of posterior contraction 
to a logarithmic factor for a-smooth functions adaptively over 



scaling variable aj for the different dimensions incorporates dimension spe- 
cific effects in the covariance kernel, intuitively enabling better approxi- 
mation of functions in anisotropic smoothness classes. In particular, one 
can let a subset of the covariates drop out of the covariance kernel by set- 
ting some of the scales aj to zero. Such a model was recently studied in 
Savitsky, Vannucci and Sha (2011), who used a point mass mixture prior on 
Pj = — logOj G [0,1]. Zou et al. (2010) also used a similar model for high- 
dimensional non-parametric variable selection. Although this is an attractive 
scheme for anisotropic modeling and dimension reduction in non-parametric 
regression problems with encouraging empirical performance, there hasn't 
been any theoretical studies of asymptotic properties in related models in a 
Bayesian framework. 

In the frequentist literature, minimax rates of convergence in anisotropic 
Sobolev, Besov and Holder spaces have been studied in Birge (1986); Ibragimov and Khasminski 
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(1981); Nussbaum (1985), with adaptive estimation procedures developed in 

Barron, Birge and Massart (1999); Hoffmann and Lepski (2002); Kerkyacharian, Lepski and Picard 

(2001); Klutchnikoff (2005) among others. The traditional way of dealing 

with anisotropy is to employ a separate bandwidth or scaling parameter 

for the different dimensions, and choose an optimal combination of scales 

in a data-driven way. However, the multidimensional nature of the problem 

makes the optimal bandwidth selection difficult compared to the isotropic 

case, as there is no natural ordering among the estimators with multiple 

bandwidths (Lepski and Levit, 1999). 

It is known (Hoffmann and Lepski, 2002) that the minimax rate of con- 
vergence for a function with smoothness along the ith dimension is given 
by 72-"o/{2ao+i)^ where = X^^L^ ct^^ is an exponent of global smoothness 
(Birge, 1986). When Oi = a for all i = 1,. . . ,d, one reduces back to the 
optimal rate for isotropic classes. On the contrary, if the true function be- 
longs to an anisotropic class, the assumption of isotropy would lead to loss 
of efficiency which would be more and more accentuated in higher dimen- 
sions. In addition, if the true function depends on a subset of coordinates 
/ = {«!,..., z^^q} C {1, d} for some 1 < do < d, the minimax rate would 
further improve to n~"''^/'^^°oi+i)^ with a^j = '^j^j OiJ^ ■ 

The objective of this article is to study whether one can fully adapt to 
this larger class of functions in a Bayesian framework using dimension spe- 
cific rescalings of a homogenous Gaussian process, referred to as a multi- 
bandwidth Gaussian process from now on. We answer the question in the 
affirmative and develop a class of priors which lead to the optimal rate 
72~"o//{2ao/+i) of posterior contraction (up to a log term) for any a and / 
without prior knowledge of either of them. 

The general sufficient conditions for obtaining posterior rates of conver- 
gence (Ghosal, Ghosh and van der Vaart, 2000) involve finding a sequence of 
compact and increasing subsets of the parameter space, usually referred to as 
sieves, which are "not to large" in the sense of metric entropy and yet capture 
most of the prior mass, van der Vaart and van Zanten (2008a) developed a 
general technique for constructing such sieves with Gaussian process priors, 
which involved subtle manipulations of the reproducing kernel Hilbert space 
(RKHS) of a Gaussian process (van der Vaart and van Zanten, 2008b). A 
key technical advancement in van der Vaart and van Zanten (2009) was to 
extend the above theoretical framework to the setting of conditionally Gaus- 
sian random fields. In particular, they exploited a containment relation 
among the unit RKHS balls with different bandwidths to construct the 
sieves Bn in their framework. Their construction can be conceptually re- 
lated to the general framework for adaptive estimation developed in Lepski 
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(1990, 1991, 1992), where a natural ordering among kernel estimators with 
different scalar bandwidths is utilized to compare different estimators and 
balance the bias- variance trade-off. However, it gets significantly more com- 
plicated in situations involving multiple bandwidths to compare kernel esti- 
mators with different vectors of bandwidths. In multi-bandwidth Gaussian 
processes, a similar problem arises in comparing unit RKHS balls of Gaus- 
sian processes with different vectors of bandwidths, and the techniques of 
van der Vaart and van Zanten (2009) cannot be immediately extended to 
obtain adaptive posterior contraction rates in this case. 

Our main contribution is to address the above issue by a novel prior spec- 
ification on the vector of bandwidths and a careful construction of the sieves 
Bn, which can be used to establish rate adaptiveness of the posterior distri- 
bution in a variety of settings involving a multi-bandwidth Gaussian process. 
For simplicity of exposition, we initially study the problem in two parts: (i) 
adaptive estimation over anisotropic Holder functions of d arguments, and 
(ii) adaptive estimation over functions that can possibly depend on fewer 
coordinates and have isotropic Holder smoothness over the remaining coor- 
dinates. In each of these cases, we propose a joint prior on the bandwidths 
induced through a hierarchical Bayesian framework. To avoid the problem of 
comparing between different vectors of scales, we aggregate over a collection 
of bandwidth vectors to construct the sets New results are developed to 
bound the metric entropy of such collections of unit RKHS balls. Combining 
these results, we balance the metric entropy of the sieve and the prior prob- 
ability of its complement. The prior specifications for the two cases above 
are easy to interpret intuitively and can be easily connected to prescribe a 
unified prior leading to adaptivity over (i) and (ii) combined. In particu- 
lar, our proposed prior has interesting connections to a class of multiplicity 
adjusting priors previously studied by Scott and Berger (2010) in a linear 
model context. 

Although our prior specification involving dimension-specific bandwidth 
parameters leads to adaptivity, a stronger result is required to conclude 
that a single bandwidth would be inadequate for the above classes of func- 
tions. We prove that the optimal prior choice in the isotropic case leads to 
a sub-optimal convergence rate if the true function depends on fewer coor- 
dinates by obtaining a lower bound on the posterior contraction rate. The 
general sufficient conditions for rates of posterior contraction provide an up- 
per bound on the rate of convergence implying that the posterior contracts 
at least as fast as the rate obtained. Castillo (2008) studied lower bounds 
for posterior contraction rate with a class of Gaussian process priors. We 
extend the results of Castillo (2008) to the setting of rescaled Gaussian pro- 
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cess priors. We develop a technique for deriving a sharp lower bound to the 
concentration function of a rescaled Gaussian process, which can be used 
for comparing the posterior convergence rates obtained for different prior 
distributions on the bandwidth parameter. 

The remaining paper is organized as follows. In Section 2, we introduce 
relevant notations. Section 3 discusses the main developments with appli- 
cations to anisotropic Gaussian process mean regression and logistic Gaus- 
sian process density estimation described in subsection 3.4. In Section 4, we 
study various properties of multi-bandwidth Gaussian processes which are 
crucially used in the proofs of the main theorems in Section 5 and should 
also be of independent interest. Section 6 establishes the necessity of the 
multi-bandwidth Gaussian process (GP) by showing that a single rescaling 
can lead to sub-optimal rates when the true function is lower-dimensional. 

2. Notations. To keep the notation clean, we shall only use boldface 
for a, b and a to denote vectors. 

We shall make frequent use of the following multi-index notations. For 
vectors a,b G R'^, let a. = Ej=iaj,a* = 0^=1 "i'^! = l\'j=i'^r^^ = 
maxj aj,a = min^- aj,a./b = {ai/bi, . . . , a^/ftrf)'^, a-b = (ai^i, . . . ,adfed)'^,a^ 
Y['j=i ■ Denote a < b if < bj for all j = 1, . . . , d. For n = (ni, . . . , n^), 
let -D"/ denote the mixed partial derivatives of order (ni, . . . , n^) of /. 

Let C[0, 1]'^ and C^[0, 1]'' denote the space of all continuous functions and 
the Holder space of /3-smooth functions / : [0, 1]*^ — )■ M respectively, endowed 
with the supremum norm = sup^g[o,i]<* I /(*)!• For /3 > 0, the Holder 

space C^[0, l]'^ consists of functions / S C[0, l]'^ that have bounded mixed 
partial derivatives up to order [/3J , with the partial derivatives of order [/3J 
being Lipschitz continuous of order /3 — [/3J . 

Next, we define an anisotropic Holder class of functions previously used in 
Barron, Birge and Massart (1999) and Klutchnikoff (2005). For a function 
/ G C[0, 1]*^, X e [0,1]^^, and 1 < i < d, let /j(- | x) denote the univariate 
function y i— )• /(xi, . . . , Xi^i,y, Xj+i, . . . , Xd)- For a vector of positive num- 
bers a = {ai, . . . , ad), the anisotropic Holder space C"-*^[0, l]'^ consists of 
functions / which satisfy, for some L > 0, 

(2.1) max sup J^WD^ fii- \ x)\\ < L, 

and, for any y £ [0, 1], h small such that y + h £ [0, 1] and for all 1 < -i < d, 

(2.2) sup L'L"d/.(y + h\x)- D^'^^ifiiy \ x) <L . 
xe[o,i]'' ~ 
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For t G M*^ and a subset / C {1, . . . ,d} of size \I\ = d with 1 < d < d, 
let tj denote the vector of size d consisting of the coordinates {tj : j € /). 
Let C[0, 1]^ denote the subset of C[0, 1]*^ consisting of functions / such that 
f[t) = g{ti) for some function g G C[0, 1]"*. Also, let C"*[0, 1]^ denote the 
subset of C"*[0, 1]*^ consisting of functions / such that f{t) = g{ti) for some 
function g G C"^[0, 1]'^'. 

The e-covering number A^(e, S", d) of a semi-metric space S relative to the 
semi-metric d is the minimal number of balls of radius e needed to cover S. 
The logarithm of the covering number is referred to as the entropy. 

We write for inequality up to a constant multiple. Let 

(t>{x) = (27r)~^/^ exp(— denote the standard normal density, and let 
(paix) = {l/a)(j){x/a). Let an asterisk denote a convolution, e.g., {(pa- * 
f){v) — J 4^a{y — x)f{x)dx. Let / denote the Fourier transform of a func- 
tion / whenever it is defined. Denote by Sd-i the d— 1-dimensional simplex 
consisting of points {x ^ : Xi > 0,1 < i < d, Yli=i Xi = I}. 

2.1. RKHS of Gaussian processes. We briefly recall the definition of the 
RKHS of a Gaussian process prior next; a detailed review of the facts relevant 
to the present application can be found in van der Vaart and van Zanten 
(2008b). A Borel measurable random element W with values in a separable 
Banach space (B, ||-||) (e.g., C[0, 1]) is called Gaussian if the random variable 
b*W is normally distributed for any element h* G B*, the dual space of 
B. The reproducing kernel Hilbert space (RKHS) EI attached to a zero- 
mean Gaussian process W is defined as the completion of the linear space 
of functions 1 1— )■ EW{t)H relative to the inner product 

(EP^(-)//i;EP^(-)//2)h = ^HiH2, 

where H,Hi and H2 are finite linear combinations of the form '^iaiW{si) 
with Oj G M and Sj in the index set of W. The RKHS of a Gaussian pro- 
cess plays an important role in determining the support and concentration 
properties of the process. 

3. Main results. Let W = {Wt : tG[0,l]'^}bea centered homoge- 
neous Gaussian process with covariance function E(VF<jVFt) = c(s — t). By 
Bochner's theorem, there exists a finite positive measure on M'^, called the 
spectral measure of W, such that 

c{t)= f e-'^^''^iy{dX), 
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where for u,v ^ C , {u, v) denotes the complex inner product. As in van der Vaart and van Zanten 
(2009), we shah restrict ourselves to processes with spectral measure v hav- 
ing sub-exponential tails, i.e., for some 6 > 0, 

(3.1) j e^\\^K{dX) < oo. 

The spectral measure i/ of a squared exponential covariance kernel with 
c{t) = exp(— ||t||^) has a density w.r.t. the Lebesgue measure given by /(A) = 
l/(2'^7r'^/2)gxp(- ||Af /4) which clearly satisfies (3.1). 

Rates of posterior contraction with Gaussian process priors were first 
studied by van der Vaart and van Zanten (2008a), who gave sufficient con- 
ditions in terms of the concentration function of a Gaussian random element 
for optimal rate of convergence in a variety of statistical problems including 
density estimation using the logistic Gaussian process (Lenk, 1988, 1991), 
Gaussian process mean regression, latent Gaussian process regression (e.g., 
in logit, probit models), binary classification, etc. As indicated in the intro- 
duction, one needs to build appropriate sieves in the space of continuous 
functions to get a handle on the posterior rates of convergence in such mod- 
els, van der Vaart and van Zanten (2008a) constructed the sieves as a col- 
lection of continuous functions within a small (sup-norm) neighborhood of a 
norm-bounded subset of the RKHS. Sharp bounds on the complement prob- 
ability of such sets can be obtained using Borell's inequality (Borell, 1975), 
and the metric entropy can also be appropriately controlled exploiting the 
fact that the RKHS consists of smooth functions if the covariance kernel is 
smooth. It is important to mention here that a similar strategy involving a 
subset of continuous functions bounded in sup-norm doesn't work beyond 
the uni-dimensional case (Tokdar and Ghosh, 2007). 

A process W with infinitely smooth sample paths is not suitable for model- 
ing less smooth functions. Rescaling the sample paths of an infinitely smooth 
Gaussian process is a powerful technique to improve the approximation of 
a-Holder functions from the RKHS of the scaled process {W^ = Wai ■ 
t G [0, 1]'^} with A > 0. Intuitively, for large values of A, the scaled pro- 
cess traverses the sample path of an unsealed process on the larger interval 
[0, A]"^, thereby incorporating more "roughness" . In the context of univariate 
function estimation, van der Vaart and van Zanten (2007) had previously 
shown that a rescaled Gaussian process W""" with a deterministic scaling 
a„ = n^/^'^°'~^^^ log'^n leads to the minimax optimal rate for a-smooth func- 
tions up to a log factor. This specification requires knowledge of the true 
smoothness to obtain the minimax rate. Since the true smoothness is es- 
sentially always unknown, one would ideally employ a random rescaling, 
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1. e., place a prior on the scale, van der Vaart and van Zanten (2009) studied 
rescaled Gaussian processes = {WAt '■ ^ G [0, l]*^} for a real positive ran- 
dom variable A stochastically independent of W, extending the framework of 
van der Vaart and van Zanten (2008a) to the setting of conditionally Gaus- 
sian random elements (see also De Jonge and van Zanten (2010) for a differ- 
ent class of conditionally Gaussian processes), van der Vaart and van Zanten 
(2009) showed that with a Gamma prior on A"^, one obtains the minimax- 
optimal rate of convergence (up to a logarithmic factor) for a- 
smooth functions. Since their prior specification does not involve the un- 
known smoothness a, the procedure is fully adaptive. 

The key result of van der Vaart and van Zanten (2009) was to construct 
the sieves i?„ C C[0, l]'^ so that given q > 0, a function wq G C"[0, 1]*^, 
and a constant C > 1, there exists a constant D > such that, for every 
sufficiently large n, 

(3.2) log iV(e-„,B„, II- 11^) <Z)net, 

(3.3) P(Ty^ ^ 5„) < e"^"^", 

(3.4) P(||P^^-t/^o|L <en) >e-"^", 

withe„ = n~°'/(^"+^)(logn)'^i, e„ = n~"/(^"+-^) (log n)'^^ for constants ki,K2 > 
O. 

There is a deep connection between the above measure theoretic result 
involving the concentration probability and complexity of the support of 
the conditional Gaussian process and rates of posterior contraction 
with Gaussian process priors, van der Vaart and van Zanten (2008a) men- 
tion that the conditions (3.2) - (3.4) have a one-to-one correspondence with 
the general sufficient conditions for rates of posterior contraction (Theo- 
rem 2.1 of Ghosal, Ghosh and van der Vaart (2000)). In a specific statis- 
tical setting involving Gaussian process priors on some function, sieves in 
the parameter space of interest can be easily obtained by restricting the 
unknown function to such sets Bn- It only remains to appropriately re- 
late the norm of discrepancy specific to the problem (e.g., Hellinger norm 
for density estimation) to the Banach space norm (sup-norm in this case) 
of the Gaussian random element to conclude that max{e„,e„} is the rate 
of posterior contraction; refer to the discussion following Theorem 3.1 in 
van der Vaart and van Zanten (2009). 

In this article, we shall consider two function classes defined in Section 

2, (i) Holder class of functions C^[0, 1]'^ with anisotropic smoothness {ex G 
M!^), and (ii) Holder class of functions C"[0, 1]''^ with isotropic smoothness 
that can possibly depend on fewer dimensions (a > and I C {1, . . . ,d}). 
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We shall study multi-bandwidth Gaussian processes of the form {W^ = 
Wsi t '■ ^ £ [0, 1]'^} for a vector of rescalings (or inverse-bandwidths) a = 
(ai,...,arf)^ with aj > for all j = 1, . . . , d. For a continuous function 
in the support of a Gaussian process, the probability assigned to a sup- 
norm neighborhood of the function is controlled by the centered small ball 
probability and how well the function can be approximated from the RKHS 
of the process (Section 5 of van der Vaart and van Zanten (2008b)). With 
the target class of functions as in (i) or (ii), a single scaling seems inadequate 
and it is intuitively appealing to introduce multiple bandwidth parameters 
to enlarge the RKHS and facilitate improved approximation from the RKHS. 

As in van der Vaart and van Zanten (2007), we shall first consider mini- 
max estimation with deterministic scalings a„. van der Vaart and van Zanten 
(2008a) showed that the rate of posterior contraction with a Gaussian pro- 
cess prior W is determined by the behavior of the concentration function 
4>wo{^) for e close to zero, where 

(3.5) </.^„(e)= inf - log P(||T^||^ < e), 

ft:EI:||/i-«)o|loo^'^ 

and EI is the RKHS of W. (We tacitly assume that there is a given statistical 
problem where the true parameter /o is a known function of wq.) Based 
on their result, with a multi-bandwidth Gaussian process prior VF^", the 
posterior distribution would asymptotically accumulate all of its mass on 
an 0(e„) ball around the true parameter, where e„ is the smallest possible 
solution to 

(3.6) <"(en);^ne2, 

with 4>^g{en) denoting the concentration function of the scaled process VF^". 
In the following Theorem 3.1, we state choices of the bandwidth parameters 
specific to (i) and (ii) that lead to minimax rates of convergence. The proof 
follows from the properties of multi-bandwidth GPs developed in Lemma 
4.1-4.4 and hence is not provided separately. 

Theorem 3.1. 1. Suppose wo G C^[0, l]'^ for some a G IR'^ and let 
^ = J2i=i -^ei a„ = (ai„, . . . , adnV, where, 

(3.7) a,-„= [ni/(2«o+i)]°oM_ 

Then, with Cn = n~°'°/^'^°'°~^^^ log''^ n for some constant ki, (p^^^icn) ^ 
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2. Suppose Wo € C"[0, 1]^ for some q > and I C {1, . . . ,d} with \I\ 
d*. Let a„ = (ai„, . . . , adnV , where, 



(3.8) 



1 ifj i I- 



Then, with en = n "7(200+^*) log'^^ ^ some constant K2, ^ 



Theorem 3.1 coupled with van der Vaart and van Zanten (2008a) impHes 
that a multi-bandwidth Gaussian process W^" with a„ as in (3.7) and (3.8) 
leads to the minimax optimal rate of convergence in cases (i) and (ii) re- 
spectively. 

Theorem 3.1 requires knowledge of the true smoothness levels or the true 
dimensionality for minimax estimation. This is clearly unappealing and one 
would instead like to devise priors on a that lead to minimax rates for all 
smoothness levels. We propose a novel class of joint priors on the rescal- 
ing vector a that leads to adaptation over function classes (i) and (ii) in 
Section 3.1 and 3.2 respectively. Connections between the two prior choices 
are discussed and a unified framework is prescribed for the function class 
{C"[0, 1]^ : a G M^,I C {1,... ,4} combining (i) and (ii). 

The main technical challenge for adaptation is to find sets i?„ so that 
(3.2)-(3.4) are satisfied with wq in the above function classes and e„ being 
the optimal rate of convergence for the same. With such sets one can 
use standard results to establish adaptive minimax rate of convergence in 
various statistical settings. Applications to some specific statistical problems 
are described in Section 3.4. 



3.1. Adaptive estimation of anisotropic functions . Let A = {Ai, . . . A^)'^ 
be a random vector in M*^ with each Aj a non-negative random variable 
stochastically independent of W. We can then define a scaled process W'^ = 
{Wj^ _i. : t £ [0, l^}, to be interpreted as a Borel measurable map in C[0, 1]'^ 
equipped with the sup-norm \\-\\^- The basic idea here is to stretch or shrink 
the different dimensions by different amounts so that the resulting process 
becomes suitable for approximating functions having differential smoothness 
along the different coordinate axes. 

We shall define a joint distribution on A induced through the following 
hierarchical specification. Let Q = ■ ■ ■ , &d) denote a random vector with 
a density supported on the simplex Sd^i- In the subsequent analysis, we shall 
assume ~ Dir(/3i, . . . , /J^) for some /3 = (/3i, . . . , f3d). Given = 6*, we let 
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the elements of A be conditionally independent, with ^ ~ g, where g is 
a density on the positive real line satisfying, 

Cix^ exp{—Dixlog'^ x) < g{x) < C2XP exp{—D2xlog'^ x), 

for positive constants Ci, C2, -Di, D2 and every sufficiently large x > 0. 

In particular, the conditions in the above display are satisfied with q = 
1 /o • 

if Aj ^ follows a gamma distribution. For notational simplicity, we shall 
assume 5 to be a gamma density from now on, noting that the main results 
would all hold for the general form of g above. 

Let TTj^ denote the induced joint prior on A, so that vrj^(a) = J Y['j=i ^i'^j I 
dj)d'7r{6). We now state our main theorem for the anisotropic smoothness 
class in (i), with a detailed proof provided in Section 5. 

Theorem 3.2. Let W be a centered homogeneous Gaussian random field 
on R'^ with spectral measure u that satisfies (3.1) and let W'^ denote the 
multi-bandwidth process with A ~ ttj^ as above. Let a = (ai, . . . ,a[i) be a 

vector of positive numbers and oq = (Sf=i o^r^)"^- Suppose wq belongs to 
the anisotropic Holder space C^[0, l^. Then for every constant C > 1, there 
exist Borel measurable subsets ofC[0, 1]*^ and a constant D > such that, 
for every sufficiently large n, the conditions (3.2)-(3.4) are satisfied by W-^ 
with en = n-"o/(2"o+i)(iogn)'"i,en = n~"o/(2ao+i)(iog j^)k2 f^j, constants 

Ki,K2 > 0. 

3.2. Adaptive dimension reduction. We next consider the smoothness 
class in (ii), namely C"[0, 1]^ for / C {l,...,d} and a > 0. If the true 
function has isotropic smoothness on the dimensions it depends on, it is 
intuitively clear that one doesn't need a separate scaling for each of the 
dimensions. Indeed, had we known the true coordinates / C {1, . . . ,d}, we 
could have only scaled the dimensions in / by a positive random variable 
A, and a slight modification of the results in van der Vaart and van Zanten 
(2009) would imply that a gamma prior on ^'^1 would lead to adaptation. 

Without knowledge of I, it is natural to consider mixture priors of the 
form Aj ~ pA + (1 — p)B, where A and B are positive random variables 
and < p < 1, so that a subset of the dimensions are scaled by A and the 
remaining by B. Assume a gamma prior on A"^ and B any fixed compactly 
supported density. We first construct a sample size dependent prior vr^ for 
A through the following deterministic specification for p = Pn assuming 
knowledge of |/| and the true smoothness level a. 

Aj ~ PnA + (1 - pn)B, j = l,...,d 
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where d* = \I\. The following theorem is a result on partial adaptive esti- 
mation, where we can adapt to the positions in / using vr^ assuming only 
the knowledge of |/| and a. 

Theorem 3.3. Let W be a centered homogeneous Gaussian random field 
on R'^ with spectral measure v that satisfies (3.1) and let W-^ denote the 
multi-bandwidth process with A ~ vr^ as above. Suppose wq G C"[0, 1]^ 
and let I C {1, . . . ,d} with \I\ = d* .Then for every constant C > 1, there 
exist Borel measurable subsets Bn of C[0, 1]'^ and a constant D > such 
that, for every sufficiently large n, the conditions (3.2)-(3.4) are satisfied by 
W-^ with e„ = n-"/(2"+"'*)(logn)'"i,e„ = n-"/(2«+rf*)(iog n)''^ for constants 

Ki,K2 > 0. 

As in the previous sub-section, our ultimate aim is to propose a joint 
prior on A so that the rescaled process satisfies conditions (3.2)-(3.4) 
without the knowledge of a or I. We describe such a prior specification 
below. 

Consider a joint prior ttj^ on A induced through the following hierarchical 
scheme: (i) draw d according to some prior distribution (with full support) 
on {1, . . . , d}, (ii) given d, draw a subset S of size d from {!,... ,d} following 
some prior distribution assigning positive prior probability to all (j) subsets 

of size d, (iii) generate a pair of random variables {A, B) with A'^ ~ gamma 
and B drawn from a fixed compactly supported density, and finally, (iv) let 
Aj = A for j E 5 and Aj = B for j ^ S. 

We next state our main result on adaptive dimension reduction. The proof 
of the following Theorem 3.4 has elements in common with the proof of the 
previous theorem, and hence only a sketch of the proof is provided in Section 
5. Theorem 3.3 can be proved along similar lines. 

Theorem 3.4. Let W be a centered homogeneous Gaussian random field 
on with spectral measure v that satisfies (3.1) and let W'^ denote the 
multi-bandwidth process with A ~ vr^ as above. Suppose belongs to the 
Holder space C"[0, 1]^ for some subset L of {1, ... ,d} and a > 0. Then for 
every constant C > 1, there exist Borel measurable subsets Bn of C[0, l]'^ 
and a constant D > such that, for every sufficiently large n, the condi- 
tions (3.2)-(3.4) are satisfied by W-^ with e„ = n-°/(2"+rfo)(iog„)«i^ = 
j^-a/(2a+do)|'iQg^)K2 jg^ constants Ki,H2 > and do = \L\. 

Remark 3.5. A salient feature of our hierarchical prior formulation is 
that the tail heaviness of A is related to the size of the subset S, i.e., the 



ADAPTIVE DIMENSION REDUCTION 



13 



number of dimensions that are scaled by the non-compact random variable A. 
For larger subsets S, the tails of A get lighter, inducing a bigger penalty for 
large values of A. In the previous mixture specification Aj ~ 7r„y4+(l— 7r„)i?, 
we believe that we needed the information of a and do in the weights Tin since 
the interplay between the size of S and the tail heaviness of A was missing. 

3.3. Connections between cases (i) and (ii). The joint distributions on 
A specified in Section 3.1 and 3.2 are closely connected. To begin with, note 
that if we set Aj = A and 6j = 1/d for all j, one obtains a gamma prior on A'^ 
which was previously suggested by van der Vaart and van Zanten (2009). In 
the general anisotropic case, the joint distribution can be motivated as fol- 
lows. Recall that the purpose of rescaling is to traverse the sample paths of an 
infinite smooth stochastic process on a larger domain to make it more suit- 
able for less smooth functions. If the true function has anisotropic smooth- 
ness, then we would like to stretch those directions more where the function 
is less smooth. Now note that for smaller values of the marginal distri- 
bution of Oj has lighter tails compared to larger values of 6j . We would thus 
like Oj to assume smaller values for the directions j where the function is 
more smooth and larger values corresponding to the less smooth directions. 
Without further constraints on 0, it is not possible to separate the scale of 
A from 9. This motivates us to constrain 6 to the simplex which serves as a 
weak identifiability condition. 

In the limit as 9j — )• 0, the distribution of Oj converges to a point mass at 
zero. Accordingly, if the true function doesn't depend on a set of {d — d*) 
dimensions, we would set Oj = for those dimensions and choose the re- 
maining Oj^s from a, d* — 1 dimensional simplex. In particular, if the function 
has isotropic smoothness in the remaining d* coordinates, one can simply 
choose 9j = 1/d* for those dimensions. This explains our choice of letting 
a'^ follow a gamma distribution in Section 3.2. 

Based on the above discussion, we combine the results in Section 3.1 
and 3.2 to prescribe a unified framework for adaptively estimating functions 
which possibly depend on fewer coordinates and have anisotropic smoothness 
in the remaining ones, i.e., functions in C'*[0, 1]^ for o: S M^J_ and / C 
{l,...,d}. 

3.4. Rates of convergence in specific settings. The above two theorems 
are in the same spirit as Theorem 3.1 of van der Vaart and van Zanten 
(2009) and Theorem 2.2 of De Jonge and van Zanten (2010) and can be 
used to derive rates of posterior contraction in a variety of statistical prob- 
lems involving Gaussian random fields. We shall consider a couple of specific 
problems with the message that similar results can be obtained for a large 
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class of problems involving rescaled Gaussian random fields. 

We first consider a regression problem where given independent response 
variable yi and covariates Xi E [0, l]*^, the response is modeled as random 
perturbations around a smooth regression surface, i.e., yi = fi{xi) + e^. We 
assume ~ N(0,a2) with a prior on a supported on some interval [a,b] C 
[0,oo). 

As motivated before, the regression surface might depend only on a sub- 
set of variables in [0, l]'^ and have anisotropic smoothness in the remain- 
ing variables. It is thus appealing to place a Gaussian process prior with 
dimension specific rescalings on fj, as follows. Let W denote a Gaussian 
process with squared exponential covariance kernel c{t) = exp(— ||t||^) and 
A = {Ai, . . . , Ad)'^ be a vector of positive random variables stochastically in- 
dependent of W. We use the conditionally Gaussian process 
t G [0, 1]''} as a prior for fi, with a joint prior on A induced through the 
following hierarchical specification: (i) draw d uniformly on {1, . . . (ii) 
given d, draw a subset S = {ii, . . . , i^} of size d uniformly from {1, . . . ,d}, 
(iii) draw 9 = {9i, . . . , 0^ from the d — 1-dimensional simplex (iv) let 

1 10 

A- ^ ~ gamma for j S S, and set the remaining Aj^s to zero. 

We denote the posterior distribution by !!(• | Let = 

X]r=i /^^(^j) denote the L2 norm corresponding to the empirical distri- 
bution of the design points. Let the true value ctq of a be contained in the 
interval [a, 6]. The posterior is said to contract at a rate e.„, if for every 
sufficiently large M, 

EMo,'Ton[(/i,o-) : ll/i - /^olU + k - o-qI > Men \ yi,...,?/n] ^0. 

Theorem 3.6. Let a = (ai, . . . , a^) he a vector of positive numbers and 
I be a subset o/{l, . . . , d}. If wq € C"^[0, 1]^, then the posterior contracts at 
the rate e„ = n-"o^/(2"o/+i) iog« where a"/ = Eje/"/^- 

Thus, one obtains the minimax optimal rate up to a log factor adapting 
to the unknown dimensionality and anisotropic smoothness. 

A similar result holds for density estimation using the logistic Gaussian 
process. Suppose Xi, . . . , X„ are drawn i.i.d. from a continuous, everywhere 
positive density /o on the hypercube [0, 1]"^. Suppose one uses a multi- 
bandwidth Gaussian process exponentiated and re-normalized to integrate 
to one as the prior on the unknown density /, so that 
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Theorem 3.7. Let a = (ai, . . . , Ud) he a vector of positive numbers and 
I he a suhset o/ {1, . . . , d}. If wq = log/o G C^[0, 1]"'^, then the posterior 
contracts at the rate = n"""^/^^""^"^^-* log'^n with respect to the Hellinger 
distance, where a^j = Ylj^i^^^- 



The proofs of the above Theorems 3.6 and 3.7 follow in a straightfor- 
ward manner from our main results in Theorem 3.2 and 3.4. We don't pro- 
vide a proof here since the steps are very similar to those in Section 3 of 
van der Vaart and van Zanten (2008a). 

4. Properties of the multi-bandwidth Gaussian process. We now 

summarize some properties of the RKHS of the scaled process for a fixed 

vector of scales a, which shall be crucially used to prove our main theorems. 

The first five lemmas generalize the results in section 4 of van der Vaart and van Zanten 

(2009) from a single scaling to a vector of scales. A key idea in van der Vaart and van Zanten 

(2009) to construct the sieves Bn was to exploit a containment relation 

among the unit balls of the RKHS with different amounts of scaling. Such 

a result sufficed in the single rescaling framework exploiting the ordering in 

elements of M+. However, the result can only be generalized with respect to 

the partial order on which is not sufficient for our purpose. We develop a 

technique to circumvent this curse of dimensionality by precisely calculating 

the metric entropy of a collection of unit RKHS balls. 

Assume that the spectral measure v oi W has a spectral density /. For 
a e R!^, the rescaled proces has a spectral measure fa given by i^a{B) = 
u{B./sl). Further, i^a admits a spectral density /a, with /a(A) = a~^/(A./a). 
For Wq S C[0, 1]'^, define 0^(,(e) to be the concentration function of the 
rescaled Gaussian process W^. 

As a straightforward extension of Lemma 4.1 and 4.2 in van der Vaart and van Zanten 
(2009), it turns out that the RKHS of the process can be characterized 
as below. 

Lemma 4.1. The RKHS of the process {W^ : t G [0, 1]'^} consists of 
real parts of the functions 




where g runs over the complex Hilbert space L2(z^a)- Further, the RKHS 
norm of the element in the above display is given by ||5'|lL2{!^a)' 



Lemma 4.3 of van der Vaart and van Zanten (2009) shows that for any 
isotropic Holder smooth function w, convolutions with an appropriately cho- 
sen class of higher order kernels indexed by the scaling parameter a belong 
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to the RKHS. This suggests that driving the bandwidth 1/a to zero, one 
can obtain improved approximations to any Holder smooth function. The 
following Lemma 4.2 illustrates the usefulness of using separate bandwidths 
for each dimension for approximating anisotropic Holder functions from the 
RKHS. 



Lemma 4.2. Assume v has a density with respect to the Lebesgue mea- 
sure which is bounded away from zero on a neighborhood of the origin. Let 
a € M."^ be given. Then, for any subset I of {1, . . . ,d} and w G C"^[0, 1]^, 
there exists constants C and D depending only on u and w such that, for a 
large enough. 



inf{||/i|| 



\h 



w\ 



< C^a-""'} < Da* 



iei 



Proof. We shall prove the result for w G C^[0, 1]*^ and sketch an argu- 
ment for extending the proof to any w G C^[0, 1]^. 

Let = 1,. . . ,d, be a set of higher order kernels as in the proof of 

Lemma 4.3 of van der Vaart and van Zanten (2009), which satisfy J ^l^j{tj)dtj 
1, f tjTpj{tj)dtj = for any positive integer k and f \tj\'^^\'>pj{tj)\dtj < 1. 
Define 'ip ^ Chy tpit) = • • • 4'd{td) so that one has ip{t)dt = 1, 

/igd t^ilj{t)dt = for any non-zero multi-index A; = (/ci, . . . , fe^), and the func- 
tions IV'I// and IV'P// are uniformly bounded, where denotes the Fourier 
transform of ip. 

For a vector of positive numbers a = (ai, . . . , ad), let V'a(i) = a*^(a • t), 
where a* = rij=i '^j- -^y Whitney's theorem, w can be extended to a function 
w : —7- M with compact support and ||w||q, < oo. Working with this 
extension, we shall first show that the convolution ipg^ *w is contained in the 
RKHS H^. To that end, note that. 



1 



(27r)« 



/a(A) 



Thus, following Lemma 4.1, we need to show that A)'i/'a(A)//a(A) G 
L2{i^a.) to conclude that V'a * w belongs to H^. Since ipaW = V'CA./a), one 
has 



/a(A) 



fa(dA) < a* 



/ 



\wiX)\^dX. 



The above assertion is thus proved by noting that \tp\'^/f is uniformly bounded 
by construction and (27r)'^ f \'w'^{X)\dX = f \w{t)\'^dt < oo. Also, the squared 
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RKHS norm of i/'a * is bounded by Da*, with D depending only on v and 
w. Thus, the proof of Lemma 4.2 would be completed if we can show that 

We have, for any t G M*^, 

ipa * u!{t) — w{t) = / ip{s){w{t — s./a.) — w{t)}ds. 



For 1 < j < d — 1, let u^^^ denote the vector in M with u[ = for 

i = 1, . . . , j and = 1 for i = j + 1, . . . ,d. For any two vectors x,y £ M , 
we can navigate from x to ?/ in a piecewise linear fashion traveling parallel to 
one of the coordinate axes at a time. The vertices of the path will be given 
by x^^^ = X, x^^^ = u^^^ • X + (1 — u^^^) ■ y for j = 1, . . . , d — 1 and x^'^-* = y. 

A multivariate Taylor expansion of w^t — s./a) around w{t) cannot take 
advantage of the anisotropic smoothness of w across different coordinate 
axes. Letting x = t,y = t — s./a and x^^\j = 0, 1, . . . , d as above, let us 
write wi^y) — w{x) in the following telescoping form, 



w{y) — w{x) = w{x 



iJ)] 



w 



(x(j'-i)) = ^wj{tj - Sj/aj I x(^')) - Wj{tj I x(^')). 



where the functions Wj are as defined in Section 2, with 
Wj{t I x) = ^^(xi, . . . , Xj-i,t, Xj+i, . . . , Xd) for any t £ M and x G W^. 
Thus, 



w 



d r L"iJ 

{t-s./a)-w{t) = Y, 

j=i L i=i 



X 



+ Sj{tj,-Sj/aj) 



where \Sj{tj, —Sj/aj)\ < Ks-' a- ^ by (2.2), for a constant K depending on 
V and w but not on t and s. Combining the above, we have 



^{s){w{t - s./a) - w{t)} 



d „ 

j Sj{tj,-Sj/aj)dtj 



If, w S C^[0, 1]^ for some subset I of with |/| = d, so that 

w{t) = wo{ti) for some wq G C'^^[0, 1]"^, then the conclusion follows trivially 
follows from the observation * w = V'aj * '"^o ■ 

□ 
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We next study the metric entropy of the unit ball of the RKHS and the 
centered small ball probability of the rescaled process. Let denote the 
unit ball in the RKHS of W^. 

Lemma 4.3. There exists a constant K , depending only on v and d, such 
that, for e< 1/2, 



log7V(e,Mf,|H|^) <i^a*( log - 



^ \ d+i 



Proof. By Lemma 4.1, an element of can be written as the real part 
of the function h : [0, 1]"^ — >• C given by 

(4.1) h{t) = I e'(^'*)g(A)i.a(dA) 



for 5 : M"* ^ C a function with / \g{\)\^ i^aidX) < 1. 

Viewing h as a function of it, we would like to exploit the sub-exponential 
tails of v as in (3.1) to extend h analytically over a larger domain in C^. For 
z G C^, we shall continue to denote the function z ^ j e^^'^^ilj{\)ua{dX) by 
h. Using the Cauchy-Schwartz inequality and the change of variable theorem, 

(4.2) \h{zf< /"e(^'2a-^^W)i/((iA), 



where Re(z) denotes the vector whose jth element is the real part of Zj for 
j = 1, . . . ,d, and a • Re(z) = (aiRe(zi), . . . , ad^e{zd))^ ■ From (4.2) and the 
dominated convergence theorem, any h S Mf can be analytically extended 
to r = {z G : ||2a • Re(z)||2 < S}. Clearly, F contains a strip Q. in given 
hy n = {z € : \Re{zj)\ < Rj,j = l,...,d} with Rj = 6/{6ajVd). Also, 
for every z £ Vt, h satisfies the uniform bound |/i(2)|^ < J e^^^'^^^v[d\) = C^. 

The analytic extension of /i to a strip containing the product of the imag- 
inary axes allows us to precisely estimate the error term of a A;-order Taylor 
expansion of h{t). For t G [0,1]*^, Let Ci,...,Crf denote circles of radius 
Ri, . . . , Rd in the complex plane around the coordinates iti, . . . ,itd of it 
respectively. Using the Cauchy integral formula. 



D'^h{t) 






•/ 











dzi ■ ■ ■ azd 



(z-t) 



n+l 



c 

< — 



where denotes the partial derivative of order n = (rii, . . . , n^). This sug- 
gests using a net of piecewise polynomials for approximating the elements of 
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If. 



One can discretize the coefficients and centers of the piecewise polyno- 
mials to obtain a finite set of functions that approximate the leading terms 
of a Taylor expansion of a function in Mf and the remainder terms can be 
controlled using the bound in the above display. 

To elaborate, let R = {Ri, . . . , Rd)^- Partition T = [0, 1]"^ into rectangles 
Ti, . . . with centers {ti,t2, ■ ■ ■ ,tm} such that given any z £ T, there 
exists Tj with center tj = {tji, . . . , tj^Y with \zi — tji\ < Ri/4:,i = 1, . . . ,d. 
Consider the piecewise polynomials P = X^^i Pj,7] ^ith 



n.<k 



We obtain a finite set of functions "Pa by discretizing the coefficients 7j^„ for 
each j and n over a grid of mesh width e/R^ in the interval [— C/i?", C/iZ"], 
with i?" 

(2009), the log cardinality of the set is bounded above by 



i?"^ . . . i?^'* and C defined as above. As in van der Vaart and van Zanten 



(4.3) 



log I n n *^^'-] <mkHog(^^ 

V J=l n:n.<k 



We can choose m ^ 1/R* . The proof is complete if we show that the resulting 
set of functions is a Ke-net for constants C and K depending on ly and 
k ^ log(l/e). The rest of the proof follows exactly as in the proof for Lemma 
4.5 in van der Vaart and van Zanten (2009) by showing that 



(4.4) 

and 

(4.5) 



TT 



n.>k 



C 



n.>k 



E 

n.<k 



n! 



< Ke. 



The proof is completed by choosing k large enough such that (2/3)'= < 
Ke. □ 



Lemma 4.4. For any oq positive, there exists constants C and eo > 
such that for a > cq and e < cq, 



logP( IIVF^^IL. < e) < Ca*( log^^ 
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Proof. This follows from Theorem 2 in Kuelbs and Li (1993) and Lemma 
4.6 in van der Vaart and van Zanten (2009). Proceeding as in Lemma 4.6 in 
van der Vaart and van Zanten (2009) and Lemma 4.3, we obtain 

(4.6) <P^{e) + log 0.5 < i^ia* [log . 

for some constant Ki > 0. Note that with L = [0, oi] x • • • x [0, ad], 

(4.7) ^^{e) = -logP{\\W^^<e) = - log P(sup | Wt| < e) 

(4.8) < -logP( sup \Wt\<e) 

iG[0,a]'* 



(4.9) < K2 



for some constant K2 and r > 0, where the last inequality follows from the 
proof of Lemma 4.6 in van der Vaart and van Zanten (2009). Inserting this 
bound in (4.6), we obtain 



(4.10) -logP(||Vr^||^ <e) <Ca*(^log^^ 



d+l 



for some constant C > 0. □ 

We next state a nesting property of the unit ball Hf of the RKHS of 
for different values of a, generalizing Lemma 4.7 of van der Vaart and van Zanten 
(2009). 

Lemma 4.5. Assume the spectral measure v satisfies (3.1) and has a den- 
sity f with respect to the Lebesgue measure on M"' which satisfies /(t./a) < 
/(t./b) for any a < b. Then, 

3- r- . PL r~ lurb 



V«i • • • «d Hf C V^i • • • fed Hi 

Proof. Let h G Hf . Following Lemma 4.1, h[t) = J e^^^'^^{X)i^a.{dX). 
Since ll/ill^a = foUo^s ^^at / |V'(A)| Va(A)dA < 1. Now, h{t) = 

/e*(^'*){^(A)/a(A)//b(A)}z^b(^^)- The conclusion follows since, 



/a(A) 



^.b=/l^(A)l^{||^}-b('iA)< 



using the fact that /a(A)//b(A) = (b7a*)/(A./a)//(A./b) < (bVa*) by 
assumption. □ 
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van der Vaart and van Zanten (2009) crucially used the above contain- 
ment relation among the RKHS unit balls in the single bandwidth case to 
conclude that {r / S)^^'^W[ contains for all a in the interval [S,r]. Combin- 
ing this fact with the observation that for very small values of a, the sample 
paths of VF" behave like a constant function, they could construct the sieves 
Bn containing MMf + eBi for all a € [0, r] without increasing the entropy 
from that of MM\ + eBi. The complement probability of under the law 
of the rescaled process could also be appropriately controlled by choosing r 
large enough so that P(^ > r) is small enough. However, one doesn't obtain 
a straightforward generalization of the above scheme to the multi-bandwidth 
case since the entropy of the sieve blows up in trying to control the joint 
probability of the rescaling vector a outside a hyper-rectangle in R^. 

The problem mentioned above is fundamentally due to the curse of di- 
mensionality and one needs a more careful construction of the sieve to avoid 
this problem. The next three lemmas are crucially used in our treatment of 
the multi-bandwidth case. In the proof of Lemma 4.3, a collection of piece- 
wise polynomials is used to cover the unit RKHS ball H^. The main idea in 
the next set of lemmas is to exploit the fact that the same set of piecewise 
polynomials can also be used to cover for b sufficiently close to a. Fur- 
ther, we shall carefully choose a compact subset Q of that balances the 
metric entropy of the collection of unit RKHS balls Mf with a £ Q and the 
complement probability of Q under the joint prior on a. 

Let S^}^ denote the interior of Sd~i, i.e., all vectors 9 G MJ^ with Yl'j=i % — 
1 and 6j > for all j = 1, . . . ,d. For u G M^^, let Cu denote the rectangle in 
the positive quadrant given by a < it, i.e., < Oj < uj for all j = 1, . . . ,d. 
For a fixed r > 0, let Q = Q^^'^ consist of vectors a with aj < r^^ for some 

9 G S^}^ . Clearly, Q is a union of rectangles C^e over 9 G sjj^}^ . Clearly, the 
volume of each such rectangle C^e is r and the outer boundary of Q consists 
of points a with aj < r for all j = 1, . . . , (i and a* = r (figure 1). By Lemma 
4.3, for any such a in the outer boundary of Q, the metric entropy of is 
bounded by a constant multiple of r log'^'''^(l/e). In the following, we show in 
Lemma 4.6 that the metric entropy of the collection of unit RKHS balls with 
a varying over the outer boundary of Q is still of the order of r log'^~''^(l/e). 
Lemma 4.7 - 4.8 establish a stronger result which states that the entropy 
remains of the same order even if the union is considered over all of Q. 



Lemma 4.6. For a positive number r > 1 and 9 G S^_^, let H^' denote 
the unit hall of the RKHS of with aj = r^^ for 1 < j < d. Then, there 
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Fig 1. Left panel: For fixed r > 1, rectangles C^o = {0 < a < r^} for different values of 
9 £ Right panel: the region Q (shaded) resulting from the union of all such rectangles. 

exists a constant Ki, depending only on v and d, such that, for e < 1/2, 
logiv(^e, U K''A\-\l) <Kir(^log^^ . 

Proof. Let Q = {a G Wl : 1 < < r Vj = 1, . . . , d, a* = r} denote 
the outer boundary of Q defined above. Clearly, 



u = u 



If. 



(0) aeQ 



For a, b G Q, the idea of the proof is to show that the piecewise polynomials 
VsL that form a Ke-net for Mf in the proof of Lemma 4.3 are also a i^e-net 
for Hp if b is "close enough" to a. 

Fix aeQ. Let = {z € : \Re{zj)\ < Rj,j = l,...,d} with 
Rj = S/{6aj^/d) denoting the strip in on which every h G Hf can be 
analytically extended. Let h £ Q satisfy maxj \aj — bj\ < 1. We shall show 
that any h € El|^ can also be extended analytically to the same strip Q,^ by 
showing that ||2b • Re(z)||2 < 5 on il.^. To that end, for z G fl^, 

||2b • Re(z)||2 < ||2a • Re(z)||2 + ||2(b - a) • Re(z)||2 
< 2||2a-Re(z)||2 < 25/3. 
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where the penultimate inequahty uses \bj — aj\ < 1 < aj for ah j = 1, . . . ,d. 

Clearly, the same tail estimate as in (4.4) works for any h G M^. From 
(4.5), it thus follows that the set of functions Va form a Ke-net for Hp. Let 
^ be a set of points in Q such that for any h £ Q, there exists a & A such 
that maxj \aj — bj\ < 1. One can clearly find an A with |^| < r'^. The proof 
is completed by observing that Uae^'Pa form a Ke net for \j0(zs^_-fl.{ . □ 

Lemma 4.7. For u S R^, let Cu denote the subset ofW^ consisting of 
all vectors a. < u, i.e., aj < Uj for all j = 1, . . . ,d. Then, there exists a 
constant K2, depending only on v and d, such that, for e < 1/2, 

IoenL [j Mf,u\<K2U*(log-^ 

Proof. The idea of the proof is similar to Lemma 4.6 in that we par- 
tition the space Cr into finitely many sets and cover the collection of unit 
RKHS balls with the scaling vector varying over one of these sets by a single 
collection of piecewise polynomials. We only sketch the partitioning scheme 
here and the rest of the proof is similar to Lemma 4.6. 

For a subset / of { 1 , . . . , d} , let denote the subset of Cu consisting of 
vectors a < it with Oj < 1 for all j G I and aj > 1 for all j ^ /. Then, 
clearly Cu can be written as the following disjoint union, 

d 

Cu = [j [j C'u. 

l=OI:\I\=l 

Fix < I < d and a subset / of {1, . . . ,d} with |/| = I. It suffices to prove 
the desired entropy bound for C^- We shall slightly modify the complex strip 
from the proof of 4.3 to exploit that for any a G Cu, the values of aj for the 
coordinates j in / are smaller than one. 

Fix a G C^. Let = {z e : \Re{zj)\ < Rj,j = l,...,d} with Rj = 
5/{6ajVd) 'lijil and Rj = 5/(6\/d) if j G /. Since ||2a • Re(z)|| < 5 for any 
z G 0^, it follows from the proof of Lemma 4.3 that any function h G has 
an analytic extension to Let b G Cu satisfy maxj \aj — bj\ < 0.5. Then 
one can prove along the lines of 4.6 that any h G can also be extended 
analytically to 0^. The remainder of the proof follows similarly as Lemma 
4.7, where the net for Cu is constructed as the union of the set of piecewise 
polynomials covering Mf-, with a varying over a finite subset of with 
cardinality 0{u*). □ 
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The following Lemma 4.8 follows along similar lines as the previous two 
lemmas. 

Lemma 4.8. Let u satisfy (3.1). Fixr > 1. Then, there exists a constant 
K2 depending on v and d only, so that, for e < 1/2, 



d+l 



logNU U U ^Ml-lloo) <^2r('logi 



5. Proof of main results. We shall only provide a detailed proof of 
Theorem 3.2 and sketch the main steps in the proof of Theorem 3.4. 

5.1. Proof of Theorem 3.2. Let us begin by observing that, 

v{jw^ - w^\^ <2^= j V{\\W^ - wo\\^ < 2e)7rA(da) 

= y I y V{\\W^ - < 2e)7r(a | 0)(ia|7r(0)(i0. 

As in van der Vaart and van Zanten (2009), we first derive bounds on the 
non-centered small ball probability for a fixed rescaling a, and then integrate 
over the distribution of a to derive the same for W'^. 

Given a G M^^, recall the definition of the centered and non-centered 
concentration functions of the process W^, 

ct>f{e) = -logP{\\W^^<e), 
(5.1) Co(e)= , inf ||/i||^a-logP(||VF^L<e). 

For a fixed a, the non-centered small ball probability of can be bound in 
terms of the concentration function as follows (van der Vaart and van Zanten, 
2008b), 

F{\\W^ - wo\\^ < 2e) > e"'^»o(^). 

Now, suppose that wq G C^[0, 1]"^ for some a G M^. From Lemma 4.2 and 
4.4, it follows that for every ao > O, there exist positive constants eo < 1/2, 
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C, D and E that depend only on wq and v such that, for a > oq, e < eo and 

- \ l+d / - N 1+d 



<^^„ (e) < Da* + Sa* log - < Kia* lo; 



e 



with Ki depending only on aQ,iy and d. Thus, for e < min{eo, CiOq by 
(5.1), for constants K2, . . . , Kq > and C2, . . . , Cg > 0, 



< 2e 

00 



>/i/ •••/ e-^i^*'°*5^+ (^/^V(a| 0)daU(0)d0 

Je [ Jai = {Ci/t)^/°'i Jad=(Ci/e)i/"<i J 

Let r denote the region in the simplex Sd-i given by F = {0 G ^^^i : r < 
0j — ^ < 2r, j = 1, . . . , d — 1}. Since X]j=i Q^o/oj = Ij we can choose r > 
small enough to guarantee that any 9 satisfying the set of inequalities lies 
inside the simplex. Moreover, with 9d = 1 — X]j=i ^31 has (d — l)r < 9d < 
2(d-l)r. Choosingr = C3/log(l/e), one can show that Ej=i( V^)^^^"'^'^ < 
C4(l/e)i/"o for any 6* G T. Now, 

••• / 7r(a I 9)da.\TT{9)d9 

ai={Ci/e)l/"i Jad=(Ci/e)V"d J 
2{Ci/e)l/°'l /■2(Ci/£)l/"d 



> . . . 

= {Cl/e)l/"l Jaa = {Ci/e)^/'^d 

> [ e-''^^U(y^)''''^'\(9)d9 



The last inequality in the above display uses that, 

(•ao/ai-T rao/a^-i-T 



I 7T{9)d9= I ' ' .../"■"' 9f'-\..9^/_-'~\l-Y.^j)^''~'d9i...d9d-i 
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can be bounded below by a polynomial in r oc l/log(l/e). Hence, 



(5.2) 



< 2e > Cfie 



Let Bi denote the unit sup-norm ball of C[0, l]'^. For a vector 9 € Sd~i 
and positive constants M, r, e, let = B^{M,r,e) denote the set, 

B^= [J (Amf) + e]Bi, 

a<r8 

where denotes the vector whose jth element is r^J . We further let, 



B= (j U (MMf ) + el 

6»G5d-i aL<rO 



'1- 



Let us first calculate the probability P(VF"^ ^ B^ \ 6). Note that, 
Y>{W^ <^B^ \e) = j P{W^ i B%(a. I 0)(ia 



ya<r'' 



where V[W^ ^ r | 6*) is a shorthand notation for P(at least one Aj > r i \ 
9). 

To tackle the first term in the last display, note that B^ contains the set 
MH^ + eBi for any a < by definition. Hence, for any a < r^, by Borell's 
inequality, 

P{W^ ^ B^) < F{W^ i MMf + ell) 



if M > — 2$ ^(e ^'^^), where the penultimate inequality follows from the 
fact that, with T = [0, l]"'. 



g-*o W = p( sup \Wt\ < e) > P( sup \Wt\ < e) = e-'^o 

iea-T t&rO-T 
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By Lemma 4.10 of van der Vaart and van Zanten (2009), <^~'^{u) > -{2log{l/u)y/^ 
for u G (0, 1). Hence, the last inequality in the above display remains valid 
if we choose 



M>4^<(e). 

1 /o ■ 

omce A/ ' follows a gamma distribution given 6j, in view of Lemma 4.9 of 
van der Vaart and van Zanten (2009), for r larger than a positive constant 
depending only on the parameters of the gamma distribution, 

P{Aj > r^^ I 9) < Cir^^e-^^r 
Combining the above, since B contains for every 9 G Sd-i, 

FiW^ ^B) = j^l^j V{W^ i B I 9)g{si \ 9) 



< 



j^i^jv{W^iB'\9)g{^\9) 



d+l 



(5.3) < Car^^e-^^^' + e-^^''^"^^''''^ 
From Lemma 4.8, the entropy of B can be estimated as, 

logiV(2e,B,|HU)< log iV(6, U \J{MM^)MJ 

eeSd-i a<r« 

(5.4) <rlog(— J 

Thus (5.2), (5.3) and (5.4) can be simultaneously satisfied if we choose, for 
constants k, ki,K2 > 0, 

e„ = ^-"o/(2qo+i) log'^(n), 
r„ = ni/(2"o+i)iog«i(^)^ 
M„ = r„log''2(n). 

5.2. Proof of Theorem 3.4- For ease of notation, we shall make the sim- 
plifying assumption that the random variable B is degenerate at 1 . For a > 
and S C {1, . . . , 4, let M"'^ denote the RKHS of W^, where aj = a for j G 5 
and aj = 1 for j ^ S. 
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For a subset S C {1,... ,d} with IS*] = d, and given positive constants 
M, r, ^, e, let 



Bs = BsiM,r,^,e) 



+ eBi 



U U (Mmf ) + el 



Since, given S, A'^ ~ gamma, it can be shown that, for some constant 
Ci>0, 

The dominating term in the e entropy of Bs is bounded by 

While calculating the concentration probability around wq G C"^[0, 1]^, sim- 
ply use the fact that pr(S' = /) > 0. 

Combining the above, the sieves B^ are constructed as, 

d 

Bn=\J U Bs{Mlrl^n,en), 

d=l S:\S\=d 



where, for constants k, ki > 0, 



dp 



y\s\ 



= n2«+do log''i(n) 



(Mf)2 = (rf)'^log(rf/6„). 

6. Lower bounds on posterior contraction rates. In this section, 
we will demonstrate that when the true density is dependent on a smaller 
number of variables, a Gaussian process prior with a single bandwidth leads 
to a sub-optimal rate of convergence. To illustrate this, we will focus on 
the example of density estimation using the logistic Gaussian process prior. 
We will show that the posterior contraction rate using a single bandwidth 
logistic Gaussian process with respect to the sup-norm topology is bounded 
below by 72-"/(2a+'^) -when the true density is 



(6.1) /o(xi, . . . , X,) = C7el--i-0-5|"\ X = (xi, . . . , x,)^ G [0, 1]'^. 
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This shows the necessity of using an inhomogeneous Gaussian process in 
high-dimensional density estimation when the true density is actually lower 
dimensional. Although lower bounds on the posterior contraction rates in 
Gaussian process settings have been previously addressed by Castillo (2008), 
the literature is restricted to series expansion priors and the Riemann- 
Liouville process priors. In this section, we have extended the results to 
Gaussian process with exponential covariance kernel having a single band- 
width. In particular, we have derived a lower bound to the concentration 
function around Wo{xi, . . . ,Xd) = \xi — 0.5|^'^ using a single inverse-gamma 
bandwidth. 

In the following, we shall consider a rescaled Gaussian process for 
a positive random variable A stochastically independent of W. Recall that 
the logistic Gaussian process prior for a density / on [0, l]'^ is given by 



We shall consider a prior distribution on A specified by A'^* ~ g, where g 
is the gamma density and d* G {1, . . . ,d}. Recall that a gamma prior on 
A"^ results in the minimax rate of contraction adaptively over log / being 
an isotropic a-Holder function of d variables for any a > 0. We shall show 
below that the above specification involving a single bandwidth leads to 
sub-optimal rate for any choice of d* S {1, . . . , d} if log /q depends on fewer 
coordinates. 

We will start with a few auxiliary lemmas which enable us to provide an 
lower bound to the concentration function of the Gaussian process W^. First 
we derive a lower bound to the concentration function (/)"(e) for a fixed a and 
then marginalize with respect to the prior for a. The lower bound coupled 
with the ability of the model (6.2) to identify the Gaussian process term 
from wq results in a lower bound to the posterior concentration rate. The key 
to obtaining a lower bound for the concentration function 4'"'{e) is to find a 
lower bound to — log P(||ty"||Q^ < e). However, it is important to note here 
that one can't just obtain a lower bound to the marginalized concentration 
function by marginalizing over — log PdlVF^IIg^ < e). It becomes necessary 
to carefully characterize the domain of a in terms of the e for which there 
exists an element in in an e-sup-norm neighborhood of wq. Lemma 6.2- 
6.4 serve to find this domain by searching for the best approximator of wq in 
H". In conjunction with our intuition, the obtained domain is [Coe~^^",oo) 
for some global constant Cq. This fact immediately provides a sharp lower 
bound to the marginalized concentration function which turns out to be of 
the same order as the upper bound up to a log-factor. Thus it is of no surprise 
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that one can only achieve a sub-optimal rate of posterior convergence using 
a single bandwidth logistic Gaussian process prior. 

Denote by the reproducing kernel Hilbert space of the Gaussian pro- 
cess W^. In the following, we define a Gaussian based higher order kernel 
as in Wand and Schucany (1990). For r > 1, let Q^r-i be the polynomial 
given by Q2r-2{x) = J2i=o C2j2;^* where 

(_l)i2«-2'^+i(2r)! 

C2i - 



r!(2i + l)!(r-i - 1)!' 



Wand and Schucany (1990) showed that Q2T-2 is the unique polynomial of 
degree < 2r — 2 for which G2T = Q2r-24' is a 2r order kernel. It is easy to 
see that r = 1 corresponds to the standard Gaussian kernel. For r > 1 and 
any 1 < j < r - 1, ^ x'^^G2r{x) = 0. 

For X G M^, define ■^^^(x) = G2r{xi) . . .G2r{xd) and for a > 0, let 
i'a'ix) = a'^ip'^'^ {ax). 

In the following Lemma 6.1, we calculate the Fourier transform of ijj^^it). 



Lemma 6.1. ^^'^(A) = g-H^H \{ 

Proof. 

V;2-(A) = / e^(^'*)V^"(t)dt 



y2s 
s=0 2^ 



r-1 



3 



'^^''^G2r{tl)---G2r{td)dt 

n / e^^^-*^^G2.(t,)dt, 



d r-1 ^2s 

where the penultimate identity follows from Wand and Schucany (1990). □ 

Lemma 4.1 of van der Vaart and van Zanten (2009) gives a nice character- 
ization of E[° in view of the isometry with the space L2(fa). In the following 
Lemma 6.2, we express each element of as a convolution of ■0a' with a 
function in C(M'^) for any given r > 1. In other words, every element of EI° 
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arises as a convolution of a higher order kernel with a function in C(IR'^) 
showing that the search for the best approximator of a C"[0, 1] function in 
the space can be restricted to only convolutions of continuous functions 
with a higher order kernel. 



that h = ip2a * ""^ 



Lemma 6.2. Given any /i G and r > 1, there exists w € C(M ) such 

2r 
2a 



Proof. By Lemma 4.1 of van der Vaart and van Zanten (2009), we ob- 
tain that any h S M"" can be written as 

(6.3) t ^ / e^(^'*)5(A)/a(A)dA, 



where f g{X)'^ fa{X)dX < oo. 
By change of variable, 

(6.4) h{t) = [ e-^(^'*)5(-A)/a(A)dA 



with j g{-\f fa{\)d\ < oo. Then h{X) = (27r)'^c/(-A)/a(A). Now observe 
that ^^''(A) is real and positive for all values of t and ip'^^{X) > e""'**" Z^. 
Also note that ^2a(A) = ''p'^^{X/2a). Hence setting w{X) = -jj^r^, we obtain 

g(-A)7r'^/^exp{-||Af /4a2} 

^(A) = ; : — j^s — • 



exp{-||Ar/8ann-=iE: 
Thus \w{X)\ < exp{- ||Af /8a^} |5(-A)| and 



2 

2 /o„2i 



\wW\dX^ < ly exp{-||Ar/8a^}|5(-A)|dA 

< j exp{- ||Af /4a2} \g{-X)\^dX 

< oo. 

As w belongs to Li, and h = V'ia''^' immediately have h = 'ijj2a * for a 
continuous function w given by 

w{t) = — ^ [ e-'^^'^^w{X)dX 



(2^) 



1 f^-i(x,) g(-A)7r'^/2exp{-||Af /4a2}^ 



exp{- II A II /8a2} H.^^ ^[^^ j2a)^^. 

□ 
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The following Lemma 6.3 says that V'a'' * can better approximate wq £ 
C(M'^) compared to Tp"^^' * w for any w 7^ wq. Lemma 6.3 further restricts the 
search for the best approximator of a C{W^) function to only convolutions 
of the higher order kernel iJj^ with the function wq itself. 

Lemma 6.3. Given any wq € C{W^) compactly supported and r > 1, 

for sufficiently large a > and for any w G C(M'^) compactly supported with 
\\w — wq\\ > 6 for some 6 > 0. 

Proof. Note that 

llV^f *w - wo\\^ > \\w - woW^ - ||(/>f *w -w\ 



00 



Since w is compactly supported, there exists ao > such that for a > gq, 
||(^^^' * w — w\\^ < 5/2. The conclusion of the lemma follows by observing 
that for a > cq, \\il^a^ * ^ ~ ^o|loo ^ "^Z^' '-' 

The following Lemma 6.4 provides a lower bound to the approximation 
error for wo{xi, . . . , Xd) = \xi — O.Sl"*^'^ , (xi, . . . , x^) € [0, l]'^ with Tp'^ * wq. 

Lemma 6.4. For wq{xi, . . . ,Xd) = \xi - 0.5|"'^'^, 

(6-5) \\wo-'ip2a*wo\\^>Coa~'^-^ 

for some global constant Cq > 0. 

Proof. Since wq G C^'^[0, l]'^, by Whitney's theorem we can extend it 
to M*^ so that Wq has a compact support with ||wo||;^ 5 < 00. Without loss of 
generality, assume wq is non- negative and the support of wq is [—L^L^ for 
some large L. Observe that 

i;l^*wo{l/2)-wo{l/2) = J ij\s)wo{l/2-s/{2a))ds 

Now since wo{l/2 - s/{2a)) = if |l/2-s/(2a)| > L, so for a > 1/2, 
{s : |l/2-s/(2a)| < L} D [-2L + 1, 2L + l]'^. Thus 

iP^{s)wq{1/2- s/{2a))ds > / 'ip^{s)wo{l/2 - s/{2a))ds 

= l/(2a)i-^ / i;\s)\si\^-^ds. 
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This shows that llit^o — V'ia * ""^oll > C'oo ^'^ where 



^ ./L2L+l,2L+l]'' 



Also it follows from the last part of Lemma 4.3 of van der Vaart and van Zanten 
(2009) that * wio e M'^ since {i^l^'^iX) = /a(A). □ 

Note that the lower bound obtained is same as the upper bound to the 
approximation error of any C^'^[0, 1] function using ip2a * w upto constants. 

The following Lemma 6.5 is crucial to the derivation of a lower bound to 
the concentration function (paiwo)- Lemma 6.5 complements Lemma 4.6 of 
van der Vaart and van Zanten (2009) and is an application of Theorem 2 of 
Kuelbs and Li (1993). 

Lemma 6.5. There exists eq > 0, possibly depending on a, such that for 
all e < eo, 



|loge 



(6.6) -logP(||ty'^||^<e)fca'^log 

Proof. Obtaining a lower bound is a simple application of Lemma 4.5 
of van der Vaart and van Zanten (2009) and Theorem 2 of Kuelbs and Li 
(1993). The proof of Lemma 4.3 of van der Vaart and van Zanten (2009) 
shows that 



logN(e,M?,|HU«a'*(^log-J 

If we define Qaix) = a'^(log ^Y^^, it is easy to observe that g is a slowly 
varying function. Then by Theorem 2 of Kuelbs and Li (1993), we obtain 



(6.7) «.)>C,..^-^j=a^(^log 

Below we show that we only need to find a crude lower bound to </>o(e) to 
obtain the required bound. Observe that 

(6.8) 0S(e) = -logP{\\W'^\\^ < e) > -logP(|T^°| < e). 

Note that ~ N(0, 1) and hence P{\W^\ < e) = {2^>(e) - 1} f« 1 + |loge| 
as e — 7- 0. Hence we obtain for sufficiently small e, 

(6.9) </.g(6)^|loge|. 
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Plugging in the bound (6.9) in (6.7), we obtain 
(6.10) ^g(e)^a'^log(B^ 



□ 

Note that the lower bound in Lemma 6.5 differs from the upper bound in 
Lemma 4.6 of van der Vaart and van Zanten (2009) only by a logarithmic 
factor suggesting that the lower bound obtained is reasonably tight. 

Finally, we calculate the tail probability of the supremum of the Gaus- 
sian process which will be crucially used to derive a lower bound to 
the posterior concentration rate. Although this is an application of Borell's 
Inequality, we will provide an independent proof to carefully identify the 
role of the prior for the bandwidth. 

Lemma 6.6. For r > 1, 

P( \\W^ > M) < 

\ II IICXD / — 

P{A >r) + 2(aM)'^exp 
for some constant C > 0. 



hl^ + C{(logr)i/2 + (log M) 1/2} 



Proof. From Theorem 5.2 of Adler (1990) it follows that if X is a cen- 
tered Gaussian process on a compact set T C M'^ and is the maximum 
variance attained by the Gaussian process on T, then for large M, 



P(llX|U>Af)<2iV(l/Af,T,|l.|l)cxp 



^{Af-KM)}' 



T 



where i/(M) = Ci /J/^^{logiV(l/M,T, |H|)}i/2^(l/M) for some constant 
Ci > 0. Observe that W"" is rescaled to T = [0, and the maximum 
variance attained by W is 1. Note that N{1/M,T, ||-||) = (aM^. Now 



v{M) < C2 {dlog(aM)}i/2d(l/Af) 
Jo 

< C3/ {(loga)i/2 + (iogM)V2}d(l/M) 
Jo 



< C3^{(loga)V2 + (iogM)V2} 
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for some constants C2, C3 > 0. Using W"" in place of X, we obtain, 

_ Im^ + C3{(loga)V2 + (iogM)V2} 

The conclusion of the lemma follows immediately. □ 

7. Main result. Below we state the main theorem on obtaining a lower 
bound to the posterior concentration rate using a logistic Gaussian process 
prior when the true density is given by (6.1). Since wq is a (:7i-5[0,l]'^ function, 
the best obtainable upper bound to the posterior rate of convergence using a 
single bandwidth logistic Gaussian process prior is t^-I-S/Cs+c*) = ^-3/(6+2d) 
upto a log factor (van der Vaart and van Zanten, 2009). In the following 
Theorem 7.1, we show that the lower bound using the sup- norm topology 
is also of the same order if we use a single bandwidth. In other words, it is 
impossible for a single bandwidth Gaussian process to optimally learn the 
lower dimensional density. 

Theorem 7.1. If fo is given by (6.1) and the prior for a density f on 
[0, 1]*^ is given as in (6.2) for any d* G {1, . . . ,d}, then 

(7.1) P(||/ - /oIL < ^-3/(6+2^^) log*« n\Yu...,Yn)^0 
a.s. as n ^ 00 for some constant to > 0. 

Proof. To obtain the lower bound, we will verify the conditions of 
Lemma 1 in Castillo (2008) with 5„ = {/ : ||/-/o||oo < Cn} for = 
7i-3/(6+2d) iQgto ^ some constant to chosen appropriately in the subse- 
quent analysis. From the proof of Lemma 5 in Castillo (2008) it follows that 
for Cfc = kd^n, k = —N,...,N and the smallest integer larger than C\fn^ 

P(||/-/0|loo<Cn) 

(7.2) < ^(IK^-^o-cfc|L<2de«) + ^(||w^^lL>cv^en). 

k=-N 

An application of Lemma 6.6 with M'^,r'^ = 0(7i^^) yields 

^(il^^lL >CV^^n) < P{A>rn)+eM-KiMl} 

< exp{-rf } + exp{-i^iM2} 

(7.3) < exp{-K2<^}, 

for some constants Ki, K2 > 0. 



^(Il^^lloo > ^ 2(aM)'^exp 
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Lemma 6.2-6.4 and the observation that wq ^ C^'^~^^[0, 1]'^ for any 6 > 
together imply that given any e > 0, there does not exist any element in 
for a < Coe~^/° such that for each k = —N, . . . ,N, 

\\wo - h- Cfclloo < e, 
where wq is given by wo{xi, . . . ,Xii) = — 0.5|^'^. From Lemma 6.5, if 



1/2 \ d+1 



/igH°:||/i-«)o-Cfc||^<e 2 



> a'^ log 



Hence for k = —N, . . . ,N, 



|loge| 



1/2 \ d+l 



P[ \\W^-wo-Ck\\^<e) < I exp<| -a'^logf ' J \da 



I oo 



Using the inequality 

POO 

/ exp{-f}dt < 2r~^v^~'' exp{-?;^}, 

J V 



we obtain that 



p(^||W^-t/;o-c,.||^ <Ciexp{-C2e-''/"|loge|''+^}, 

for some constants Ci,C2 > 0. Thus, from (7.3) and (7.2), 

(7.4) P{ 11/ - /oIL < Q < C3iVexp{-C4e-'^/"}, 

for some constant C3 > 0. From van der Vaart and van Zanten (2009) it 
also follows that 

(7.5) P(SA'L(/o,en))>e-^^"«", 
for some constant to > and C5 > where 



(7.6) i?xL(/o,e) = |/: ^ /o log ^ < e^ ^ /o ( log ^) 



2 



By adjusting to,C4 and C5, we have from (7.4) and (7.5) 

< exp{-2<2|^ 



P{\\f-ML<Cn) _,2, 



P{BKL{f0,Cn)) 

which proves the assertion of the theorem by Lemma 1 of Castillo (2008). □ 
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Remark 7.2. Note that the lower hound log*° n for d > 1 

is only a sub-optimal rate for estimating wq, the optimal rate being given 
by n~^/^ which is actually achieved by a multi-bandwidth Gaussian process 
prior. Refer to Theorem 3.7 for details. 

Remark 7.3. Note that we have derived a lower bound to the posterior 
contraction rate only for this special choice of fo given in 6.1. The choice 
is motivated by the fact that it is easy to find a lower bound to the best 
approximation error of this function within the class M"". More generally 
one might be interested in finding a subset of C"[0, 1]"^ for a fixed a > 
such that we can characterize both the best approximator and a lower bound 
to the approximation error for each of the elements in the subset. This would 
require a different version of Lemma 6.4 in each of the cases. However the 
general recipe provided in Lemma 6.2-6.4 remains the same. 

Remark 7.4. One can also obtain a lower bound to the posterior con- 
centration rate in other statistical settings, e.g., the Gaussian process mean 
regression using the same technique. This would need careful characteriza- 
tion of the upper bound to the concentration probability of the induced density 
around the truth i.e., P{\\f — fo\\^ < Cn) terms of the concentration prob- 
ability of the Gaussian process around wq similar to that for the logistic 
Gaussian process in Theorem 7.1. Interested readers might find an outline 
of such an exercise in Section 7.7 of Ghosal and van der Vaart (2007). 
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